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Abstract 

'We discuss the advantages of paralleHzation by muhithreading on graphics processing units (GPUs) for parallel tempering Monte 
Carlo computer simulations of an exemplified bead-spring model for homopolymers. Since the sampling of a large ensemble 
of conformations is a prerequisite for the precise estimation of statistical quantities such as typical indicators for conformational 
transitions like the peak structure of the specific heat, the advantage of a strong increase in performance of Monte Carlo simulations 
cannot be overestimated. Employing multithreading and utilizing the massive power of the large number of cores on GPUs, being 
available in modern but standard graphics cards, we find a rapid increase in efficiency when porting parts of the code from the 
central processing unit (CPU) to the GPU. 

'Keywords: GPU, CUDA, Monte Carlo simulations, parallel tempering, polymers, structural transitions 
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1. Introduction 

I Computer simulations have become a fundamental pillar in 
physics. In particular, computer simulations are frequently the 
only choice to understand physical properties of complex and 
cooperative behavior of systems which require a detailed mod- 
"eling. This is certainly apparent in structural biophysics and 
polymer physics, where effective many-body interactions and 
'disorder eff'ects cannot be tackled by means of analytical ap- 
proaches alone. Even for simplistic models, computation time 
can become interminable if a large amount of data is needed as, 
e.g., in statistical physics. 

Despite large advances in the design of central processing 
unit (CPU) architectures most of the above-mentioned needs 
'could not be met by simulations on single CPU systems. Sev- 
eral approaches to speed up simulations have been employed, 
e.g., parallel computing using message passing on clusters or 
multithreaded programming on multicore CPUs. 
' Graphics processing units (GPUs) have become very power- 
ful, in recent years, driven by the professional computer gam- 
ing industry. GPUs possess a massively parallel architecture. 
With the latest release of NVIDIA's convenient programming 
language CUDA, GPUs have become popular in scientific com- 
puting. GPU computing finds its application in many fields, 
such as astronomy y], |21], medicine 131 14|], time series analy- 
sis for financial markets ||51], molecular dynamics simulations 
Slltl, Monte Carlo studies of spin systems lililli^l, and Quan- 
tum Monte Carlo applications llllll . We are interested in the 
thermodynamical properties of polymer models, both on lattice 
I H [11 and ofi^-lattice ([ll[il. Previous studies |fl5i[l6l[l7|] of 
an elastic polymer model revealed a complex, chain-length de- 
pendent structural transition behavior. For relatively high tem- 
peratures, the polymer chain has a wide spread coil-like struc- 
ture. At the ©-point - where monomer-monomer attraction and 



repulsion by volume exclusion is just balanced - the polymer 
collapses from the random coil to more globular structures. In 
this globular "phase", there is no internal structure. This is com- 
parable to a liquid. At even lower temperatures, sort of a freez- 
ing transition is observed. 

The purpose of this paper is to show that GPU simulations 
can also quite efficiently be performed for off-lattice polymer 
models without any need of highly sophisticated tricks of im- 
plementation. By employing a straightforward implementation, 
of massive paralleHzation provided by GPUs, we investigate 
the possible speed-up for replica-exchange Monte Carlo sim- 
ulations of an off-lattice model for elastic polymers. 

The paper is organized as follows. Section |2] describes de- 
tails of the GPU architecture and CUDA. In Section [3] we give 
a brief introduction of the investigated model and the simula- 
tion technique used. The results of our studies are presented in 
SectionH] A summary of our findings in given in Section|5] 



2. General-Purpose Computation on Graphics Processing 
Units 

Since massively parallel general-purpose computation on 
GPUs is not standard, despite the large number of applications 
in the past few years, let us review the main features of multi- 
threaded GPU architectures and the most frequently used spe- 
cific language CUDA. 

2.1. GPU Architecture 

A GPU is composed of a number of streaming multiproces- 
sors (SM) with on-chip shared memory only visible to that SM 
and a large global memory, often with sizes in the range of 
1 - 4GB, in today's graphics card architectures. The kernel 
- the main function of a GPU program - runs the same code 
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Figure 1: Grids with tliread blocks. 



in parallel on a number of threads given by the grid and block 
layout. 

The grid of independent thread blocks (Fig. [1) can be or- 
dered in one or two dimensions. It is possible to launch the 
same kernel or others with a differing grid layout within the 
same program successively, with the Fermi cards even con- 
currently. Threads, i.e., individual processes, can be arranged 
within these blocks in up to three dimensions with a maximum 
size of 512 X 512 X 64 on NVIDIA's GT200-based cards or 
1024 X 1024 X 64 on Fermi-based cards. However, it is also 
limited to 512 or 1024 threads per block, respectively. Each 
thread can be identified by a unique id, which can be calcu- 
lated from its coordinates within the grid. At execution time, 
32 threads are grouped together in a warp. Those warps are 
then assigned to a SM. For maximum performance, it is best 
to start many more threads than cores exist on the device. The 
same program can run on a wide range of CUDA-capable de- 
vices with different number of cores, because the distribution 
of warps to cores is done on the device. This feature is called 
transparent scalability lIlSll . Also, if certain threads within a 
warp are reading data from global memory, the device is able 
to hide the memory latency by executing a warp of threads that 
does not have to wait for another operation. 

Graphics cards come with several types of memory (Fig. |2]i 
accessible to the program. The largest is the global memory, 
which can be as large as 4GB in professional special purpose 
cards like the Fermi series from NVIDIA. The global memory 
can be read and written by specific functions from the CPU - 
also called host -, and every thread on the GPU - also called 
device - has read and write access to global memory. This large 
memory is necessary since it is not possible for the device to 
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Figure 2: Memory layout on a GPU device. 

access the RAM of the host. This means all data that needs to 
be processed by the GPU has to be copied to the device for cal- 
culation and copied back to the host for evaluation. A downside 
of the global memory is its high latency. 

The constant memory supports short latency read-only ac- 
cess by the device if all threads read from the same location in 
memory. For specific data types, the texture memory is avail- 
able. Registers and shared memory are fast on-chip memories. 
Access to these types of memory is usually a lot faster than 
global memory. Registers are assigned per thread and only ac- 
cessible by that thread. Shared memory is allocated to a thread 
block within a SM and all threads in this block can read and 
write to that memory. The number of available registers is lim- 
ited to 16384 on GT200-based chips and is twice as large on 
new Fermi-based cards. The size of the shared memory was 
increased from 16KB on GT200 to 48KB on Fermi cards 111 9:1. 



2.2. Cotnpute Unified Device Architecture: CUDA 

With the release of NVIDIA's set of programming tools 
(CUDA) in 2007, the exploitation of the potential of GPUs has 
become more feasible. The CUDA toolkit comes in two vari- 
ants, a high-level C/C-n- interface to GPU functions, the so- 
called CUDA runtime API and a more low-level programming 
layer, the CUDA driver API, which is closer to hardware. The 
toolkit contains a set of extensions to the C programming lan- 
guage to accomplish the most common tasks in GPU program- 
ming, like memory management and operations, but also new 
data types for mathematical calculations. For a detailed descrip- 
tion of the CUDA programming language, see i20ll . The draw- 
back is that CUDA currently only runs on NVIDIA hardware. 
However, there is a vendor-independent alternative in active 
development, called OpenCL (Open Computing Language^ 
Its programming interface is comparable to the driver API of 
NVIDIA's CUDA. 



http://www.khronos.org/opencl/ 
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Figure 3: Sequence of a parallel GPU program. 

Since at this time CUDA seems more mature than OpenCL 
and programming the CUDA runtime API is much more con- 
venient than low-level programming, this is what we chose for 
our implementation. 

A scheme of the sequence of a CUDA parallel program is 
shown in Fig. [3] The program starts on the CPU like any 
other program with the initialization of variables and data. With 
CUDA one also has to allocate and initialize all memories that 
are needed for the calculations on the GPU. When all data is 
copied to the device, the kernel is started, and the program now 
spreads into parallel threads running on the GPU. The execu- 
tion of the kernel on the GPU is completely independent from 
the calling program, which could proceed in its own execution. 
For this reason a synchronization barrier has to be implemented 
in the main program to wait for the calculations on the device to 
finish. After the kernel finishes its execution, the results of the 
computations are copied back to the host memory for further 
processing. 

2.3. Random Number Generation 

Monte Carlo simulations require a multitude of random num- 
bers. In our implementation, the required amount of random 
numbers for a single kernel call is generated on the CPU and 
copied to global memory before the actual execution of the ker- 
nel. Each thread block then copies the needed numbers to its 
shared memory during the execution. This is not the most opti- 
mal solution, because the memory transactions between global 
and shared memory slow down the overall execution time. It 
would be more efficient to run an independent random num- 
ber generator on each thread block that fits entirely into shared 
memory. With the advancing size of the shared memory on the 
device, this is an option for future implementations. 

3. GPU Simulations of an Elastic Flexible Polymer Model 

3.1. Polymer Model 

As an example for a molecular system, we consider an elas- 
tic, flexible bead-spring homopolymer chain. All monomers 
interact via a shifted and truncated pairwise Lennard-Jones po- 
tential: 

E'lfinj) = £u(min(r,v, r,)) - E^,(r,), (1) 



Euirij) = 4e 



(2) 



where r,j is the distance between two monomers / and j. The 
Lennard-Jones parameters are e = 1 and cr — 2"'^^ro, the min- 
imum of the potential is at ro = 0.7 and the cutoff' radius is 
re = 2.5cr. To model the bonds between adjacent monomers, 
we use the finitely extensible nonlinear elastic (FENE) anhar- 
monic potential 
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(3) 



which has its minimum also at ro and diverges for r —^ ro + R 
with R - 0.3. The spring constant K for the FENE bonds equals 

40 in El- 

The total energy of a conformation C - (ri, . . . , Yn) is then 
given by 



, N N-l 



(4) 






The conformational behavior of elastic polymers in this model 
has already been investigated in detail in Refs. il5ill6ll . 



3.2. Monte Carlo Update 

The local update used throughout the simulation is the fol- 
lowing: A random monomer of the chain is picked and a shift 
of its coordinates by a random vector is proposed. The com- 
ponents of this displacement vector are uniformly distributed 
random numbers within the interval [-0.01, 0.01]. 

For a given inverse thermal energy fi - I /ksT at tempera- 
ture T, this proposal is accepted or rejected according to the 
standard Metropolis |2 ll] criterion 



p = min (1 , exp [-/?(£„„ - E^)]) , 



(5) 



where E,,^^ is the energy of the proposed new conformation and 
Em the energy of the original one. 

3.3. Replica-exchange Method 



In the replica-exchange parallel tempering 11221 I23L |24|] 
method, a simulation of «,. copies, i.e., replicas of the same sys- 
tem, is run at different temperatures. After a certain number of 
Monte Carlo updates an attempt to exchange the conformations 
of neighboring replicas / and /+ 1 is performed. The probability 
to accept such an exchange is given by 



p = min(l,exp[(£',-£',+i)(/3,-y8,+i)]). 



(6) 



This heats up and cools down every copy of the system, which 
helps to avoid barriers in the free-energy landscape and to re- 
duce autocorrelation times. Thus in principle, the effective 
statistics can be increased. 

In this study we show that parallel tempering can quite effi- 
ciently be run on massively parallelized architectures, such as 
CPUs. 



3.4. CUDA Implementation 

In this section, we will focus the attention on implementation 
details. The following listings contain CUDA specific syntax 
emphasizing the most relevant parts of a CUDA program. All 
calculations are performed using single-precision floating-point 
operations, because support for double-precision is not gener- 
ally available. Double-precision performance is to be further 
improved in future chip generations. Listing [1] shows how the 
main function of a GPU program is invoked. First, one needs 
to set up the dimensions for the grid of threads. In this case, the 
dimension of the grid (dimGrid) is set to the number of replicas. 
This means that every replica of the systems runs independently 
in its own thread block. The size of such a thread block dim- 
Block is set to a constant BS which depends on the number of 
beads in the polymer. Details will be discussed in Section l4!2l 



dims dimGrid (NCONFS) ; 
dims dimBlock(BS) ; 
run«<dimGrid ,dimBlock»> 

(d.confs , d.rnds , d.energies , 
cudaThreadSynchronize () ; 



d.rees , d.rgyrs ) ; 



Listing 1 : Specification of the thread layout and kernel call within the main 
program. 

After setting up the grid layout, the kernel run is called with 
the given layout embraced by triple chevrons and a list of argu- 
ments. The same code is then executed by each thread. 

Because a kernel function call returns immediately, it is nec- 
essary to call cudalhreadSynchronizeO as a synchronization 
barrier. Thus the CPU waits for the GPU to finish the calcula- 
tions before proceeding. In our implementation the exchange 
of replicas is done on the CPU, while the GPU is used for the 
Metropolis algorithm with the expensive energy calculation. A 
pruned record of the insides of the kernel is shown in Listing|2] 



.global., void run 
(Polymer* d.confs, float* d.rnds, 
float* d.rees, float* d.rgyrs) 



float* d.energies , 



int id = blockldx.x; 
..shared.. Polymer ps ; 
// . . initialization of some variables 
if (TX == 0) ( 

ps = d.confs [ id ] ; 
I 
..synctfireadsO ; 

while (n < NSWEEPS) I 
oneSweep(&ps , rnds , n) ; 
ener = energy(&ps) ; 
if (TX == 0) I 

Ree = endToEndDistance(&ps) ; 

Rgyr = radiusOfGyration(&ps) ; 

d.energies [n+offset ] = ener; 

d.rees [n+offset ] = Ree; 

d.rgyrs [n+offset ] = Rgyr; 

n++; 
I 



-syncthreadsO ; 
I 

if (TX == 0) I 
d.confs [ id ] = ps ; 

I 
..synctfireadsO ; 



Listing 2: The kernel function - showing the usage of shared memory and the 
main work loop. TX is the unique id for each thread. 

In line 5 of Listing |2l the index of the current thread block is as- 
signed to an integer variable and is used to link this thread block 



to a specific replica. The shared memory for a local copy of the 
replica is allocated in line 6. This means that all threads within 
this block, and only those, have fast access to the copy. The 
actual copy process is shown in lines 6-11, where only one 
thread is assigned to copy the polymer from global to shared 
memory. The barrier in line 1 1 lets all other threads of the block 
wait for the copying to finish, before proceeding with the actual 
calculation. In line 15, the energy calculation is invoked. The 
details of the parallel implementation and CUDA specifics are 
explained in Listings|3]and|4] Again, only one thread is used in 
lines 16 - 23 to collect statistics. The local copy of the polymer 
is copied back to global memory in line 27 for further evalua- 
tion on the CPU. 

Due to the fact that every thread executes the same code, it 
is possible to insert conditions based on the id of the thread to 
alter what each thread actually calculates. 

Since in this model there are only pairwise interactions be- 
tween monomers, it is possible to parallelize the calculation of 
the energy. For the FENE part of the potential this is particu- 
larly straightforward, because only next neighbors are involved, 
see Listing |3] 



if (TX < N-1) ( 

r = distance(p, TX, TX+1); 

energy[TX]+=-0.5tK*R*R...logf (1 -((r-rO) /R) *(( r-rO) /R) ) ; 
I 
..synctfireadsO ; 



Listing 3: Calculation of pairwise FENE interactions. 

For the Lennard-Jones part of the potential, it is not that trivial 
to get the indices of all possible neighbors. The calculation of 
the relevant indices is included in Listing |4] 



for (int i=0; i<N/2; !++) ( 
if (TX < N) I 
if (TX > 1) I 
indexl = i ; 
index2 = TX; 
i 
else I 

indexl = N-2-i ; 
index2 = index1+1+TX; 



r = distance(p, 
if (r<rc && ( I 

float rs6 = . 

energy [TX] += 



-synctfireads () ; 



indexl , index2 ) ; 
1= N-2-i )) { 
.powf (sigma/r , 6. Of); 

4*epsilon*((rs6*(rs6-1))-E.lj.rc); 



Listing 4; Calculation of pairwise Lennard-Jones potential. 

For both parts of the potential, one pair of monomers is as- 
signed to one thread to perform the actual calculation of the 
energy. The results of those calculations are stored in the array 
energy. When all threads are finished with their part a parallel 
reduction is performed on this array to obtain the total energy. 
Instead of using a single thread and a loop for the summation, 
multiple threads calculate different parts of the sum. See 12511 
for details. 

4. Results 



4.1. Thermodynamics 

Before discussing details of the efficiency of the GPU simu- 
lations, let us first briefly review the thermodynamic al proper- 



Table 1 : Specifications of the used liardware. 





reference CPU 


GPUl 


GPU2 


GPU3 


name 


Xeon E5620 


Tesla CI 060 


GTX285 


GTX480 


# processors 

# cores per processor 
RAM 


1 

4 (only 1 used) 

16384MB 


30 

8 

4096MB 


30 

8 

1024MB 


15 

32 

1536MB 


clock speed 


2.4GHz 


1.3GHz 


1.48GHz 


1.4GHz 


max. threads per block 


- 


512 


512 


1024 


shared memory size 


- 


16kB 


16KB 


48kB 


registers per block 


- 


16384 


16384 


32768 



ties of elastic polymers with 13 and 55 monomers, which non- 

trivially freeze into icosahedral structures at low temperatures. 

Figure m shows the specific heat of the polymers, given by 



C 



A^ N dT N 



(7) 



There is a change in the monotonic behavior of the curve at 
approximately T - l.Q for the 13mer and approximately T = 
1 .5 for the 55mer. This is an indicator for a structural change 
within the polymer chain, in this case the 0-collapse, which 
describes the finite-system analog of the phase transition from 
random-coil conformations to globular shapes. 

The very distinct peak at low temperatures, T x 0.33 for the 
13mer and T x 0.32 for the 55mer, is a sign for the freezing 
transition. Below this temperature, the crystalline polymer has 
an icosahedral structure. These results conform with previous 
studies HI [Tell. 

Figure |5] shows the fluctuation of the squared radius of gyra- 
tion, given by 

1 ^ 

'"gyr = T^ /j^''* " '™<=an)^, (8) 



A=l 



where rmean is the center of mass of the polymer. The radius of 
gyration describes the mean distance of every monomer to the 



center of the polymer From its fluctuations, structural transi- 
tions can be identified. The peaks for the freezing transition ap- 
proximately coincide with the peaks in the specific heat. Much 
more distinct are the peaks that indicate the 0-collapse at higher 
temperatures, compared to the corresponding weak signals in 
the specific heat. 

4.2. Performance comparison 

To compare the performance of each implementation, we de- 
fine a speed-up factor as follows. 



fCPU 

fcpu 



(9) 



where fcpu is the execution time on a single CPU core and fcpu 
is the runtime on the GPU. All runtimes were measured with 
CUtil-timers, which are wrapper functions to the standard C li- 
brary call gettimeofday. This ensures a consistent time mea- 
surement on a variety of systems, and to measure CPU and 
GPU time alike. On the CPU side, only the time taken by the 
actual calculation was measured. No initialization of variables 
or any file operations were included in the time measurement. 
To consider the extra overhead which comes with GPU com- 
puting, the time taken to copy data hence and forth the device 
has been included along with the time taken for the sweeps. 
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Figure 4: Specific heat of an elastic polymer chain with lengths 13 and 55. The Figure 5: Fluctuation of the squared radius of gyration for chain lengths 13 and 



inset shows the icosahedral structure of the 55mer. 
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Figure 6: Speed-up factor S p vs. number of replicas «, for naive porting of 
CPU code to tlie GPU. Lines are only guides to tlie eye. 



Figure 7: Average GPU time (tcpu) in seconds vs. number of replicas n,- for 
the naive GPU version. 



For the speed comparison, only short runs of 4000 sweeps per 
replica were performed. Every 1000 sweeps the replicas were 
copied back to the host to propose a swap of temperatures be- 
tween neighboring copies according to Eq. (|6]l. The simulated 
system was of size N = 55. The specifications of the three 
tested GPUs are listed in Table [T] First, naive portings of CPU 
to GPU codes already show maximum speed-ups around 6-7 
for the GT200-based cards GPUl and GPU2 compared to the 
reference implementation on the CPU. With GPU3, based on 
the Fermi architecture, the maximum speed-up is about 9. Fig- 
ure |6] shows the dependency of the speed-ups on the number of 
replicas for different GPUs. In the naive approach, each replica 
of the system was assigned to one thread block containing only 
one thread. Because of the embarrassingly parallel nature of 
the parallel tempering algorithm, it is possible to outperform a 
single CPU when more than 36 replicas of a system are simu- 
lated on GPUl, 32 repHcas or 13 replicas for GPU2 or GPU3 
respectively. This is possible due to the large number of cores 
available on GPUs, even though their clock speeds are lower 
than that of modern CPUs. 

As mentioned in Section |2] the size of thread blocks is lim- 
ited and threads on the GPU are bundled to groups called warps. 
The maximum number of simultaneously active threads in a 
multiprocessor is 1024 (or 32 warps) for GT200-based cards 
(GPUl and GPU2) and 1536 threads (or 48 warps) for Fermi- 
based GPU3. These warps do not necessarily have to belong 
to the same thread block. Thus also 16 warps from 2 different 
thread blocks can be active simultaneously in a single SM, or 
3 blocks of 10 warps, or 4 blocks of 8 warps, and so on, up to 
8 blocks of 4 warps. This grouping of warps from up to 8 dif- 
ferent thread blocks is a current limitation of the GPU architec- 
ture. The SM is not able to gather as many warps from different 
thread blocks until its warp or thread limit is reached; explain- 
ing the peaks in Figure |6] Since there are 30 multiprocessors 
on GPUl and GPU2, the maximum number of active warps of 
the device is reached for 240 thread blocks of 1 thread. Each 
SM calculates the single threads of 8 diff'erent thread blocks. 
GPU3 however has 15 multiprocessors, thus it is only capable 



of executing 120 blocks at a time with 1 thread per block. The 
maximum speed-up for this thread layout is to be expected at 
multiples of 240 for GPUl and GPU2, and multiples of 120 for 
GPUl. Essentially thread blocks with threads other than multi- 
ples of 32 - the warp size - are not recommended at all. Another 
interesting finding is that the total runtime of the kernel is the 
same, whether only one multiprocessor is busy and the others 
are idling, or all multiprocessors are equally busy. This leads to 
a step-like graph, see Fig. |7l when plotting the kernel runtime 
versus the number of replica, i.e., the number of thread blocks. 
With this in mind, an improved version was implemented 
with a parallel calculation of the energy function, as shown in 
Listings [3] and m The thread block size in this version was set 
to 64 - a multiple of the warp size, for better scheduling - each 
block holding one replica. This improved version also exploits 
low-latency memory access by using shared memory for stor- 
ing the coordinates of the monomers. Thus all threads within 
a block have fast access to them when needed for the calcula- 
tion of their portion of the energy. Performance is also gained 
by substituting calls to standard C math library with optimized 
CUDA versions ll20ll . see Listing [3] line 3 and Listing|4]line 13. 
Those usually have a lower precision than their counter-parts, 
but are executed faster. As shown in Fig. [8] this implementa- 
tion is much faster than the CPU version, when more than 2 
replica are simulated. The maximum speed-up factor for GPUl 
is 68, for GPU2 it is 78 and for GPU3 even 130. Again, for 
the two GT200-based cards, multiples of 240 threads blocks 
are a limit for the maximum speed-up. With 240 active thread 
blocks of 64 threads each, there are 15360 threads running on 
the GPU. The total number of threads divided by the number 
of SMs in these cards equals 512. These 512 threads are a col- 
lection of 2 warps from 8 different thread blocks. So, with this 
parametrization, the occupancy of the multiprocessors is only 
at 50%, since 1024 active threads per SM are possible with 
GT200-cards. For GPU3, the first maximum is at 120 thread 
blocks, which complies with a total number of 7680 threads. 
Even though GPU3 is capable of running 1536 threads in each 
of its 15 multiprocessors at a time, only 512 threads are active. 
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Figure 8: Speed-up factor S p vs. number of replicas n^ for the GPU version 
with parallelized energy calculation. Drawn lines are only guides to the eye. 



due to the 8 thread block limitation (this equals an occupancy of 
33%). These occupancy values come from the fact that only 64 
threads per replica are used for the energy calculation. A differ- 
ent implementation with more threads or bigger systems sizes, 
which require more threads, could increase the occupancy of 
the multiprocessors and thus increase the efficiency. The step- 
like shape of the curve in Fig.|9]every 30 replica for GPUl and 
GPU2 coincides with the number of SMs, meaning that with ev- 
ery additional thread block the overall speed-up drops until each 
of the 30 multiprocessors again is equally busy. Also noticeable 
is that for GPUl and GPU2 the first maximum of the speed-up 
factor is reached for 120 thread blocks. That means all cores 
are equally busy, but apparently there seems to be plenty of la- 
tency in memory operations. Thus, the overall speed-up is not 
affected by adding the same amount of work to each multipro- 
cessor, up to 240 thread blocks in total. For GPU3 the increase 
in kernel runtime occurs every 15 thread blocks, corresponding 
to the number of SMs in the given card. Consequently, to max- 
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Table 2: Overview of maximum achieved speed-ups - max I5,,(«,-)) - for the 
two diiferent GPU implementations, compared to the single-core CPU imple- 
mentation. 
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Figure 9: Average GPU time (topu) in seconds vs. number of replicas n,- for 
the improved GPU version. 



imize the benefit from GPU implementations it is necessary to 
keep all multiprocessors on the GPU equally busy. 

Table |2] is a summary of the maximum speed-up factors 
achieved in our simulations, employing the two different im- 
plementations. The ratio of speed-ups from GPUl and GPU2 
is nearly the same as the ratio of their clock speeds. Whereas 
GPU3 with a similar clock speed shows significant speed-ups, 
which originate from the difference in the chip design of the 
two GPU generations. 



5. Summary 

In this paper, we have shown that replica-exchange Monte 
Carlo simulations of off-lattice polymer models can be per- 
formed quite efficiently on GPUs. Even for off-lattice poly- 
mer models this is a suitable approach. Already with a very 
simple naive porting of CPU code to the GPU, we find con- 
siderable performance gains of factors about 6-9 compared to 
a single CPU implementation. Utilizing the unique architec- 
ture of GPUs, with its different memory layers and the abil- 
ity to schedule a massive amount of threads, we improved the 
GPU program to attain speed-up factors of around 70 for main- 
stream GPUs like the GTX285, and even factors up to 130 
with NVIDIA's new generation Fermi-based GTX480 card. It 
should be noted that our implementations represent a rather ba- 
sic level of utilizing the advantages of GPU architectures. 

Furthermore it is possible to access multiple cards in a sin- 
gle workstation from one and the same program with no ex- 
tra effort. Also nodes of established cluster computers can be 
equipped with GPUs, a combination of the traditional message 
passing interface (MPI) and CUDA is used in such a scenario. 
Thus GPUs promise great gains in productivity and might help 
building the next-generation supercomputers. 
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